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We propose a model that explains the hierarchical organization of proteins in fold families. 
The model, which is based on the evolutionary selection of proteins by their native state stability, 
reproduces patterns of amino acids conserved across protein families. Due to its dynamic nature, 
the model sheds light on the evolutionary time scales. By studying the relaxation of the correlation 
function between consecutive mutations at a given position in proteins, we observe separation of the 
evolutionary time scales: at the short time intervals families of proteins with similar sequences and 
structures are formed, while at long time intervals the families of structurally similar proteins that 
have low sequence similarity are formed. We discuss the evolutionary implications of our model. 
We provide a "profile" solution to our model and find agreement between predicted patterns of 
conserved amino acids and those actually observed in nature. 
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I. INTRODUCTION 

Understanding protein evolution still remains a major challenge in molecular biology p]-|l3||. While the mechanisms 
of mutations in DNA sequences that code for proteins are known juj, the selective fixation of these mutations in 
proteins is far from clear. Mutations occurring in DNA are directly governed by physical-chemical processes and their 
fixation is subject to cellular repair mechanisms being able to preserve nucleotide(s) from modifications. Selection 
by evolution is much softer in DNA than in proteins due to a stronger inter-dependence of the protein structure, 
function and kinetic properties than that of DNA. Since mutations may drastically alter the physical, chemical, and 
biological properties of proteins, evolution exerts pressure to preserve those amino acids that play an important role 
in the folding kinetics, functionality and stability of proteins. Our goal is to understand evolution from the statistical 
mechanics perspective. 

There are several principal facts observed in proteins: (i) a protein sequence folds into a unique three-dimensional 
structure (there might be exceptions, e.g. prions); (ii) protein sequences are selected, i.e. a randomly chosen polypep- 
tide most likely aggregates in solution without forming any definite three-dimensional structure; (Hi) proteins taken 
from various species and having sequence identity, ID, at least ID = 25 - 30% have similar three-dimensional struc- 
tures {native state) [ p^ , |l5| -f2l| and are said to belong to the same fold family; (iv) some pairs of proteins sharing the 
same fold have sequence similarity as low as expected for random sequences ID ~ 8 - 9% |^,^,^3|; (v) within the 
same fold family, protein sequences have only 3 - 4% "anchored" amino acids § . A set of proteins that have at least 
25% sequence similarity and are structurally similar are called homologs. A set of structurally similar proteins that 
may have less than 25% sequence similarity is called a group of structurally homologous proteins or analogs. Analogs 
include several families of homologs and generally constitute a larger set of proteins than homologs. Known homologs 
and analogs are collected in the HSSP jl9| and FSSP (22) databases respectively and are the subject of our study. 

Here we propose a model of evolution (Z-score model) that, based on facts (i) and (ii), attempts to reproduce the 
rest of remaining principal observations (Hi) - (v) described above. The Z-score model is based on the design of a 
set of structurally identical sequences by the Z-score minimization (2~i|-|2l| . The idea is to find the similarities in the 
sequences of such a set and to recover those residues that are conserved across this set. The protein folding theory 
|p7| , p8| suggests that Z-score minimization is equivalent to maximizing the energy gap between misfolded or unfolded 
conformations and the native state of a protein. It has been pointed out that such maximization results in stable and 
fast-folding proteins Thus, by designing sequences that have the same fold, we attempt to mimic evolution 

in diversifying protein sequences for the same fold family. In addition, the Z-score model is a dynamical model, i.e. 
there is an implicit time scale that allows one to follow the evolution of sequences during the design procedure. The 



model is discussed in detail in Sec. [II] and a profile approximation to this model is outlined in Sec. III. In Sec. IV we 
show that our view of evolution proposed below is consistent with the implications of the proposed model. Next we 
discuss our scenario of protein evolution. 
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We conjecture that hierarchical organization of structurally similar proteins may be the result of the separation of 
the evolutionary time scales, shown schematically on Fig. |l|. On a time scale r G , a set of mutations occur that do 
not affect those amino acids that play thermodynamical, kinetical and/or functional roles. As a result, there is little 
variation in sequences at the important sites of proteins. If a mutation occurs at the thermodynamically, kinetically 
and/or functionally important sites, it usually substitutes amino acids with close physical properties so that core, 
nucleus and/or functional site are not disrupted and the protein folds into its family fold, is stable in this fold, and 
its function is preserved. At this time scale, a family of homologs is born. 

Rarely, at time scale r, correlated mutations occur |30|-|32|] that modify several amino acids at the core, nucleus 
and/or functional site, so that the stability and kinetics of proteins are not altered. Such a set of mutations can 
drastically modify the sequence of the protein. However, within the time scale r , a family of homologs is born within 
which there is conservation of (already new) amino acids in the specific (important) sites of homologous proteins. 
Although there are alternations in the specific sites of the proteins at the time scale t, these sites are more preserved 
than the rest of the sequence. The proposed view of protein evolution is consistent with the observations of hierarchical 
organization of structurally similar proteins in families of homologs. Sets of families of homologs are organized, in 
turn, in super- families of analogs. 



II. Z-SCORE MODEL 



We start with a random protein amino acid sequence and perform a Monte Carlo search for the mutation that 
energetically favors interactions in such a sequence. The Monte Carlo design algorithm is based on the minimization 
of the so-called Z-score, defined as 

r7 _ E NS - (E) 

Z a{E) ' (1) 

which corresponds to the minimization of the energy gap between the native state, E^s, of the selected sequence and 
the average energy, (E), of structurally unrelated conformations (decoys) |j3|-[35|. a(E) is the standard deviation of 
energies of all decoys (see Fig. |J). 

Since Z-score minimization is equivalent to maximizing the energy gap between misfolded or unfolded conformations 
and the native state of the protein |2^j2^,Q, such maximization results in stable and fast-folding proteins. The energy 
gap must be "significant" , meaning that Ens must deviate from (E) by many standard deviations a: Ens *C (E) — a. 
Many researchers have pointed out (see, e. g., review p^|) that minimization of the Z-score corresponds to the 
stabilization of the protein in its native state. 

The design proceeds as follows: (i) we select an amino acid <Ji at a random position 1 < i < N; (ii) we substitute 
this amino acid by with probability p, 

r i, asz < o . . 

P \ eMSZ/T des ), \iSZ>0, {A> 

where SZ = Z(a' i ) — Z(a) is the difference between the Z-scores of the mutated and the original proteins. We design 
each of N s — 100 sequences by running the simulations for N m Monte Carlo steps at some design temperature, T& es . 
Computation of (E) and a(E) is straightforward: 

{E) = ^U(fT i ,a j )f ij , (3) 
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and 



o*{E) = ((25 - {E)f) = \Y. / w (l - hWHau <rj) + 0(/?) . (4) 
where f,j is the frequency of a contact between monomers i and j in a set of decoys, i. e. 



/« = <Aii>. (5) 

We estimate frequencies of contacts by making two assumptions about the set of decoys: (1) the distribution, 
Pit — of the contact distances, I — \i — j\, between various amino acids at the positions i and j is universal 
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among globular proteins; and (2) the actual frequency of contacts between various amino acids, i and j, is only a 
function of the absolute value of the length of contacts, \i — j '|, and is equal to the distribution of the contact lengths, 
i. e. 

fij = f\i-j\ = m ■ (6) 

The distribution P{£) is then 

Both assumptions, (1) and (2), are motivated by the fact that the variety of protein structures known to date samples 
adequately the conformational space of proteins under study and the variety is large enough that all the information 
about the secondary structure peculiarities of individual proteins are averaged out. 

In order to estimate frequencies, fij, according to Eq. (Q) we compute the distribution of contacts of length £ = 
in the ensemble of approximately 10 3 representative globular proteins in Protein Data Bank (PDB) |3^ , |37t . The 
distribution, shown in Fig. [|, is obtained using C/3-representation of proteins. The contacts are defined by Eq. (|2p|). 

The estimation of contact frequencies fij is one of the key ingredients to protein design. An alternative approach 
based on sampling of homopolymer conformations appears to be less efficient, so we omit it in the present study. 
Nevertheless, due to its importance and possible potential for other studies, we discuss this approach in Appendix [a|. 

After we obtain N s number of designed sequences, we compute the probability of an amino acid Ofc to be in kth 
position, Pz(<7k), as the frequency of occurrence of this amino acid, 

Pz{(r k ) = N{a k )/N s , (8) 



where N(ak) is the total number of occurrences of an amino acid Ofc at the position k. Next, using Eq. (|22|) we 
compute the sequence entropy, Sz{k). 

III. PROFILE SOLUTION 

We develop a profile solution to the Z-score model that provides a rationale for conservatism patterns caused by 
selection for stability. Our solution is of equilibrium evolution that maintains stability and other properties achieved 
at an earlier, prebiotic stage. To this end we propose that stability selection accepts only those mutations that keep 
energy of the native protein, E, below a certain threshold E c necessary to maintain an energy gap j^j2^,^,^]. The 
requirement to maintain an energy threshold for the viable sequences makes the equilibrium ensemble of sequences 
analogous to a microcanonical ensemble. In analogy with statistical mechanics, a more convenient and realistic 
description of the sequence ensemble is a canonical ensemble, whereby strict requirements on energy of the native 
state is replaced by a "soft" evolutionary pressure that allows energy fluctuations from sequence to sequence but 
makes sequences with high energy in the native state unlikely. In the canonical ensemble of sequences, the probability 
of finding a particular sequence, {er}, in the ensemble follows the Boltzmann distribution ||^4|,[38 40 



P(W) = sEtmm , (9) 

where T is the effective temperature of the canonical ensemble of sequences that serves as a measure of evolutionary 
pressure and Z = CX P ( — E{a}/T) is the partition function taken in sequence space. 

Next, we apply a profile approximation that replaces all multiparticlc interactions between amino acids with inter- 
action of each amino acid with an effective field $ acting on this amino acid from the rest of the protein, so that each 
amino acid experiences the exact field of its neighbors. This approximation presents P({tr}) in a multiplicative form 
as Y\k=iP( fJk ) °f probabilities to find an amino acid a at position k |4jj |. p(ak) also obeys Boltzmann statistics 

exp(-<j>(q fc )/T) 

PK) = E CT exp(- $ K)/T)- (10) 

The profile potential ${ok) is the effective potential energy between amino acid o> and all amino acids interacting 
with it, i. e. 
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N 



(11) 



The potential is similar in spirit to the protein profile introduced by Bowie et al. J42| to identify protein sequences 
that fold into a specific 3D structure. 

For each member, m, of the fold family (FSSP database |2l|) presented in Fig. [j], we compute the profile probability, 
Pm(&k), using Eq. (|l0|). This probability, p m (&k), for each fold family member corresponds to the frequency of amino 
acids, <7fc, at positions, k, for a given family of homologs. Then, we compute the average profile probability over all 
members of the fold family, 

1 Ns 

Pp(o-jfc) = j-r ^2 ■ ( 12 ) 

s m— 1 

This quantity corresponds to the P aC r(^k) presented in J30|. Eqs. ( [To|) — (|l2|), along with the properly selected 
energy function, U , make it possible to predict probabilities of all amino acid types and sequence entropy Sp(k) at 
each position k 

Sp(k) = - ^pp(<Tk) lnpp(cTfe) (13) 

from the native structure of a protein. The summation is taken over all possible values of a. 

If stability selection is a factor in the evolution of proteins and our model captures it, then we should observe a 
correlation between the predicted profile based sequence entropies, Sp(k), and actual sequence entropies S aC r(k) in 
real proteins. Thus, the question is: "Can we find such T, so that the predicted conservatism profile Sp(k) matches 
the real one S acr {k)T 

By varying the values of the temperature T in the range 0.1 < T < 4.0, we minimize the distance, D 2 = 
^2 k= i(Sp{k) — S acr (k)) 2 , between the predicted and observed conservatism profiles. We exclude from this sum 
such positions in structurally aligned sequences that have more than 50% gaps in the structural (FSSP) alignment. 
We denote by T se i the temperature that minimizes D. 

The proposed profile solution has a dual role. On one hand, it allows us to understand the selective temperature 
scale, T se i, which is the measure of evolutionary optimization. On the other hand, the correlation coefficient between 
Sp(k) and S aC r(k) does not vary strongly in the range of T se ; from 0.19 to 0.34, thus, allowing one to use the effective 
temperature of T se i — 0.25 to predict the actual conservatism profiles of proteins (see Table pi). 



IV. RESULTS AND DISCUSSION 



We study five folds: Immunoglobulin fold (Ig), Oligonucleotide-binding fold (OB), Rossman fold (R), a//3-plait 
(a//3-P), and TIM-barrel fold (TIM). The three-dimensional structures of the representative proteins of these five 
folds are shown in Fig. ||: (a) Tenascin (Third Fibronectin Type III Repeat), pdb:lTEN; (b) Major Cold Shock Protein 
7.4 (Cspa (Cs 7.4)) of Escherichia Coli, pdb:lMJC; (c) chemotactic protein CheY, pdb:3CHY; (d) Acyl-Phosphatase 
(Common Type) From Bovine Testis, pdb:2ACY; (e) Endo-Beta-N-Acetylglucosaminidase Fl, pdb:2EBN. We com- 



pute the correlation coefficient [ 
results are summarized in Table 
in Figs. |l|-0. 



I between values of Sp(k), obtained at T se [, and S acr (k) for all five folds. The 
The plots of Sp(k) and S acr (k) versus k, as well as their scatter plots, are shown 



A. Z-score model 



We find that correlation between Sz{k) and S acr {k) strongly depends on the number of mutations, N m , we introduce 
during design of a protein. This fact is in accord with our view (see Fig. ^) of protein evolution. On a short time 
scale, t ~ 10 2 Monte Carlo steps, mutations rarely alter amino acids with specific important properties such as 
participation in stabilization of proteins and/or in the nucleation processes in folding kinetics of proteins. These 
mutations diversify the family, m, of homologs, A^™. On a larger scale, r ^ r D , correlated mutations p0|-p2[ modify 
core and/or nucleus site(s) of the proteins without compromising their stability, folding rates and function(s). Thus, 
at the time scale r evolution moves from one family of homologs to another, diversifying the underlying family of 



4 



analogs, Ai a , {JMT C M. a . The ensemble of analogs is still much smaller than the ensemble, M. , of all possible 

rn 

sequences (M a Q Mo), which is of the size 6 N (in a 6-letter alphabet) — for N = 100 residue protein this number 
is of the order of 10 78 . These results are in agreement with the theoretical predictions [p|, p7|p9|j4^ -flr| that there is a 
large number (of the order e 19Af p9[) of fast folding sequences with a given native structure and pronounced stability 
gaps A = E NS - (E). 

It is important that for the small number of mutations we find correlation between entropies of the designed 
sequence, Sz(k), and the empirically observed one, S aC r(k). This correlation depends on the input random number, 
indicating that the selected sequences constitute a family of homologs, M.™, that is closer or more distant to an 
original sequence family of homologs, (both VW™ and belong to a given family of analogs, M a )- Here 
we present the results for the selected ensembles of the designed sequences, M.™, after being optimized during N m 
mutations. More important than the correlation between Sz(k) and S aC r(k), we find that the profiles of Sz{k) and 
(k) are in visible concert with each other. 

The temperature dependence of the Z-score exhibits a sharp transition at T = T c « 0.25 (Fig. |J) for all studied 
proteins. Above T c , protein design results in unstable sequences, while at temperatures much lower than T c many of 
the residues "freeze" in their original states. Thus, we select T c as our design temperature. 



B. Degree of divergence of sequences 

To assess the degree of similarity or divergence of sequences in the course of Z-score design at various time scales, 
we compute the distribution of hamming distances at these time scales. The hamming distance, Hd^cr}^, {cr}^ 2 -*), 
between two sequences, {c}^ and {a}( 2 \ is defined as the number of distinct amino acids at equal positions in these 
two sequences divided by the length of the sequences, N: 

1 N 

Hd({a}^\{a}W) = ±J2[l-5(aV -a?)]. (14) 

i=l 

Hamming distance has a simple interpretation — it is the degree of divergence between two sequences: when Hd is 
equal to 1, the sequences have no amino acids in common, when Hd is equal to 0, the sequences are exactly the same. 

We compute the distribution of hamming distances between all designed sequences for two design times (a) td — 
10 3 3> t (Fig. ||(a)) and (b) td = 10 2 ~ r D Monte Carlo steps (Fig. ||(b)), where r Q is a characteristic time scale. 
We use 1MJC family of homologs (OB fold) as an example throughout this subsection. We also performed similar 
analysis with other folds and the results are qualitatively the same (not shown). 

In the computation of the distribution in case (b) we omit all sequences with sequence similarity less than ID — 55% 
to mimic sequence collection in the HSSP database. This threshold sequence similarity, ID, is chosen so that the 
hamming distance distribution derived from the actual sequences in the HSSP database (Fig. |(d)) is similar to ours. 
Given that we use a six-letter alphabet, the correspondence between ID used in HSSP and our ID is not well defined. 
Because in (b) we select only sequences with minimal threshold similarity, ID, there are no events with Hd > 0.55 in 
Fig. [|(b). In addition, the events with low Hd —* are overrepresented in our simulations since we do not account 
for additional pressure due to function or kinetics that exists in real protein sequences. Therefore, the distribution in 
our simulations (Fig. ||(b)) has a more pronounced tail Hd — > than that in real proteins (Fig. ||(d)). 

At the long time scales (a) we find that most of the sequences are divergent from each other with average (Hd) w 0.7. 
We observe the same result by computing the distribution of the hamming distances between all analogs belonging to 
the OB fold family present in the FSSP database (Fig. ^(c)). The only difference between simulated (Fig. ^(a)) and 
observed (Fig. |^(c)) distributions of hamming distances is the tail present in the simulated distribution corresponding 
to the sequences with a significant degree of similarity. This tail is due to the fact that we compare all sequence with 
all sequences, thus, effectively including similar sequences in our histogram. In the FSSP database, on the other hand, 
only distant sequences are present so that the tail corresponding to the close sequences in Fig. ^(c) is absent. 

The distributions of hamming distances in protein families are in qualitative agreement with those observed in 
simulations and with our picture of hierarchical protein evolution. At short time scales sequences are not strongly 
separated from each other forming families of homologs, while at long time scales, a family of analogs is formed, 
comprised of strongly separated sequences but structurally similar proteins. 
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C. Determination of the family formation time scale 



To quantify our observation of evolutionary time scales separation, we compute the relaxation times of the correlation 
function (at each protein position, k) in the course of Z-scoie design defined as 

C k (r) = — - £ / xi a \t,r)dt = (( X k(r)))t d ,N B , (15) 

s a—l ^ 

where ((. . -))t d ,N s denotes average over simulation design time, td, and the number, N s , of initial sequences. x k (*> T ) 
is a boolean indicator of whether an amino acid a k it + r) at position k at time t + r is the same as the amino acid 
at time o k it) at the same position at time t: 

y ^(t r) = ( ^(t)=a k (t + r) 

Xk {hT) \ 0, a k (t)^a k (t + r). [ib) 

Cfe(r) measures the probability that a mutation does not occur at the position k in time r. This function for most 
equilibrium systems decays exponentially, 

C(r) ~ cx P (-t/t ) , (17) 

where t d is the relaxation time that is the average mutation time between subsequent mutations. 

We also find that the correlation function computed for N s — 10 3 and for td = 10 3 decays exponentially (see Fig 0) 
and relaxation times r depend strongly on the positions of the amino acids under consideration. For example, the 
relaxation of the correlation functions for positions 1 (Ser in 1MJC) and 31 (Val) in 1MJC design vary by almost 
a factor of two: r D (Serl) = 143 and r (Val31) = 387 Monte Carlo steps. The fact that r (Serl) is more than two 
times larger than r (Val31) indicates that Serl is likely to mutate more than twice in the time-span of a single Val31 
mutation. 

In addition, the distribution of the relaxation times (see Fig. ^) exhibits a pronounced peak at r c = 170 Monte Carlo 
steps, indicating that for most protein positions relaxation occurs with this typical relaxation time. The relaxation 



times found from the correlation function analysis are in agreement with our observations in Sees. IV A and IV B. The 
long non-gaussian tail in the histogram of the relaxation times also suggests the presence of the conserved positions. 
In fact, this tail, composed of the conserved positions, strongly deviates from the rest of the distribution, which is 
well approximated by a Gaussian distribution. 



D. Rates of amino acid substitutions and conservatism 



A number of authors suggested [|47| — |50f| that study of the conserved amino acids in families of structurally similar 
proteins can shed light on the functionally, kinetically and thermodynamically important amino acids in proteins. 
The basic belief behind the majority of such studies is that evolution optimizes, to a certain extent, the properties 
of proteins so that they become more stable and have better folding and functional properties. Here we use the 
"optimization" hypothesis of molecular evolution to understand the universe of protein sequences by implication of 
molecular evolution. The link between conserved amino acids and their role in proteins has been widely studied 



A recent study of Mirny and Shakhnovich |30J identified the presence of universally conserved amino acids across 
the families of proteins sharing the same fold. These conserved residues have been linked to protein stability, kinetic 
properties or function. Various experiments ]5l| , |56| |65t have identified some of the conserved residues to have predicted 
specific roles. 

Direct evidence of the relationship between conservatism and the physical properties of amino acids can be accessed 
by calculating the rates of amino acid substitutions in the course of the Z-score design. By comparing mutational 
rates at various positions of the proteins, we attempt to reconstruct the conservatism of these positions across the 
family of analogous proteins. Starting with the sequence of a representative protein of a given fold we perform Z-score 
design for td = 10 8 Monte Carlo steps. The substitution rates are defined as 



R(k) ^ = f J2 \l - 5(a k (t) a k (t - 1))1 , (18) 
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where N m (k) is the number of mutations that occurred at the position k, 8{x) is a Kronecker function, equal to 1 if 
x = and otherwise, (Jk(t) is an amino acid a at the position k at time t, and f<j — td/N is the average number of 
attempted mutations per position in a protein. Thus, R(k) from the Eq. ([l8]) is inversely proportional to the average 
time between subsequent substitutions of amino acids at the position k; the lower the R(k) the longer the amino acid 
at the position k remains unchanged and, therefore, the more conservative is this position in the course of design. 

We find that the rates of substitutions, R(k), correlate with the conservatism patterns, S acr (see Figs. || - [l2| ). 
Since there is no obvious relation between R(k) and S acr , and, moreover, there is no reason to assume linear relation 
between these quantities, the linear regression has only an illustrative meaning of the correlations observed between 
R(k) and S acr (see Figs. || - |l2](b) and Table |). Despite the likely lack of linear relation between the rates and the 
entropy, the correlation observed based on the assumption of linear dependence between R{k) and S acr is feasible. 

The computation of mutational rates, R(k), does not involve the tuning of any parameters. We can choose any 
non-zero temperature, given that the total number of Monte Carlo steps, td, in the course of design is large enough 
to obtain statistically significant values of R(k). We also find that at Td es = 0.25 the data for R(k) is identical after 
10 7 Monte Carlo steps to that after 10 8 Monte Carlo steps, so the values of R(k) are statistically significant. 

Interestingly, the fastest rates are at most two times faster than the slowest rates. Such variability of rates might be 
due to the variability in physical properties of amino acids. It has been shown f36f that there are only two principal 
eigenvalues of Miyazawa-Jernigan energy matrix |6?]]68[| and the remaining eigenvalues are close to each other. Such 
a "degeneracy" in eigenvalues accounts for the similarities in physical properties of amino acids. 

Another possible reason for such a wide range of ratevariability is the absence of the side chains in our model. The 
side chains are an additional factor that slow down the rates because of the frustrations caused by the multiple side 
chain conformations (69| and, possibly, increase the range of rate variability. Despite all the artifacts of our model, 
the correlation between R(k) and S aC r is significant, which indicates that the model does qualitatively capture the 
evolutionary selection of proteins. 



E. Profile solution 



The correlation between Sp(k) and S acr (k) is remarkable for all five folds and indicates that our profile approxima- 
tion is able to select the conserved amino acids in protein fold families and properly describe the formation of families 
on the short time scales (Table ||). It is fully expected that the correlation coefficient is smaller than 1. The reason 
for this is that computation of Sp(k) takes into account evolutionary selection for stability only and it does not take 
into account possible additional pressure to optimize kinetic or functional properties. 

The additional evolutionary pressure due to the kinetic or functional importance of amino acids results in pronounced 
deviations of Sp from S acr for a few amino acids that may be kinetically or functionally important. A number of amino 
acids whose conservatism is much greater than predicted by our model form a group of "outliers" from otherwise very 
close correspondence between Sp and S acr . To demonstrate that some of those amino acids are important for folding 
kinetics and, as such, they can be under additional evolutionary pressure, we color data points on the Sp versus S acr 
scatter plot according to the range of 0- values [fzO^| that the corresponding amino acids fall into. The thermodynamic 
and kinetic roles of individual amino acids were studied extensively (i) by Hamill et al. jTl[] for the TNfn3 (1TEN) 
protein, (ii) by Lopez-Hernandez and Serrano [p8| for the chemotactic protein (CheY, pdb:3CHY), and (Hi) by Chiti 
et al. Q for muscle acylphosphatase (AcP, pdb:2ACY). 

We use the </>- values for individual amino acids obtained in |58|,(7l|]. We observe that (i) for TNfn3 protein most of 
the points on Fig. [b^(b) that belong to the outlier group have 0- values ranging from 0.2 to 1; (ii) for CheY protein 
most of the points (for which 0-values are known) on Fig. |l5|(b) that belong to the outlier have 0-values ranging 
from 0.3 to 1; and (Hi) for AcP protein, one nucleic amino acid, Tyrll, is strongly conserved, more than predicted 
by the profile solution, while the second amino acid, Pro54, belonging to the nucleus J72] does not appear to be 
conserved. The third nucleus amino acid, Phe94, in AcP protein is excluded from our analysis due to the lack of data 
at position 94. The discrepancy of the Pro54 conservatism and its kinetic role may be attributed to the poor statistical 
significance of S acr (k) calculation at this position. Figs. |l3|(b), |l5|(b), and |l^(b) demonstrate that the presence of 
additional evolutionary pressure due to the kinetic importance of amino acids results in stronger conservatism of 
specific positions than predicted by profile solution. 

It has been conjectured (see e.g. M) that on average only 3 - 4% of residues are "anchor residues", i.e. those that 
are more significantly conserved than the rest of the residues. In fact, this observation is supported by the S aC r(k) 
profile of the sequences and their profile estimates Sp(k) (see Figs. |l^(a) - |l^(a)). These 3 - 4% of "anchor residues" 
are the principal "gates" to the structure/kinetics of a given family of proteins. For example, it has been shown 
Q that the number of residues that belong to the nucleus of a model protein is about 5%; we expect the same low 
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percentage of residues that determine the kinetics of real proteins. The number of key residues that form a functional 
site is also a small fraction of the total number of residues in proteins. 

In order to demonstrate the statistical significance of the outliers' kinetic importance, we show that the number of 
sites with high values of <f> found among the outliers is larger than that expected if such sites were randomly distributed 
across all values of Sp. For Tenascin, the total number of residues is N to t — 89, the number of sites with <p > 0.2 
is N to t((f> > 0.2) = 17, the number of outliers is N out = 13, and the expected number of sites with <fi > 0.2 among 
outliers is N^f((f) > 0.2) = N tot (<j) > 0.2)N out /N tot « 2.5. The observed number of sites with > 0.2 among outliers 
is N out (4> > 0.2) = 8, which is over three times more than expected. A similar estimate for <j> > 0.5 gives N^f(cf> > 
0.5) = 0.75 and N out (<p > 0.5) = 2, which is nearly three times more than expected. For CheY, the total number of 
residues is N to t = 128, the number of sites with <f> > 0.3 is N tot (cf> > 0.3) = 11, the number of outliers: N out = 22, 
and the expected number of sites with <f> > 0.3 among outliers is N^((j) > 0.3) = N to t{4> > 0.3)N ou t/N tot ~ 1.9. 
The observed number of sites with <\> > 0.3 among outliers is N out (4> > 0.3) = 4, which is over two times more than 
expected. These crude estimates demonstrate that outliers have, in fact, a higher than expected number of residues 
with pronounced kinetic role, hinting towards an additional evolutionary pressure exerted on kinetically important 
amino acids. 



F. Convergent or divergent evolution? 



It has been a long-standing question |j|ll],[74 - 76 1 whether the presently known proteins have evolved from a smaller 
family of prebiotic proteins ("divergent" evolution scenario) or whether they evolved from ancestors with distant 
homology and due to thermodynamic, kinetic, and functional pressure exerted by evolution they converged to struc- 
turally similar proteins ("convergent" evolution scenario). The model of evolution, proposed in this work does not rule 
out any of these scenario. However, the similarity in distribution of hamming distances in the family of homologous 
proteins produced by our model to that taken from nature is striking (Fig. ^|), serving as a hint in favor of divergent 
evolution. 

An important argument favoring diverging evolution is that a function of a protein is strongly susceptible to 
protein structure pd|p^ ]. So, if a protein were to change the structure in the course of evolution, it would affect its 
functionality (there are, of course, possible exceptions). Functionless genes have little chance of surviving in cells, so 



these proteins would most probably be eliminated. However, Murzin |11| proposed a way for functionless protein to 
survive by fusion with another functional protein and evolving already as a unit to a multifunctional protein. One of 
the most prominent examples are the DNA polymerases that are composed of similar domains with different sequence 



composition 77 - 79 



If we set aside multi-domain proteins, the fact that there is a limited amount of folds (< 1, 000 according to Chothia 
or < 7,920 according to Orengo has been extensively used to favor convergent evolution ||[|H,|l(],[l3| . Li et 
al. H used the designability principle to show from full enumeration of lattice protein models that the number of 
members of a fold family depends on the stability gap A. This dependence means that many unrelated sequences 
search in the course of evolution for the stable conformations and as soon as they reach a basin of a certain fold 
with large enough energy gap they stay within that basin. The scenario proposed by also explains why various 

fold families are unequally populated Q — the number of family members depends on the energy gap. The more 
pronounced the energy gap is the more mutations such a fold can tolerate. Buchler and Goldstein |13|| argued that 
the energy gap depends on the number of non-local contacts of a given fold. 

There are several questions about that scenario. First, it is not clear if nature exploits all possible folds even 
though there are only 1,000 folds it is possible that nature simply does not need more of them. Second, Chothia 
and Gerstein fiofl argued that the restriction on the divergence of proteins from one another does not come from a 
stability requirement (which is of course important) but from the separation of the mutated residue from the active 
site. Thus, the extent of sequence divergence is inversely proportional to that of protein function(s). The experiments 
by Gassner et al. [M and Axe et al. |SlJ support the arguments of Chothia and Gerstein p0| . In these experiments 
substitution of the several amino acids in the hydrophobic core of the T4 lysozyme conserved to a certain extent 
the function and the structure of the protein. Third, we note that there is a limited amount of types of chemical 
elements that are part of the ligand structures that are bound by the active site. Thus, we expect that there are 
groups of evolutionary unrelated proteins with similar binding sites and the structures. In fact, there are examples 
of proteins sharing the same site, also called a "super-site". For example, both transforming protein p21H-RAS-l 
fragment (pdb:lCTQA) and chemotactic protein 3CHY have similar binding sites, the root-mean-square deviation 
of one protein from another is 3.2A while there are only 13 identical residues (i.e. ID < 10%). Interestingly, the 
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active site of 1CTQA is centered around Mg 2+ , while the active site of 3CHY is built by residues only (Aspl2, Aspl3, 
Metl7, and Asp57). 

It is possible that evolution follows several paths at the same time and the question of whether evolution is divergent 
or convergent is just ill-posed. To answer this, we still need more evidence to rule out one scenario or another. 



To conclude, we present a hierarchical model that attempts to explain sequence conservation caused by the most 
basic and universal evolutionary pressure in proteins to maintain stability. Using this model, we show that separation 
of basic time scales (that constitute a broad distribution with long tails) in evolution is a plausible scenario for the 
sequence heterogeneity of structurally homologous proteins. The two basic time scales are t d and r 3> t q \ (i) at 
time scales, r G , most mutations that occur in protein sequences do not alter the protein's thermodynamically and/or 
kinetically important sites and form families of homologous proteins; (ii) at time scales t»t„ mutations occur that 
would alter several amino acids at the important sites of the proteins in such a way that the properties of the proteins 
are not compromised. At time scale t the family of analogs is formed. Mutational rates, directly computed during 
Z-score design, show agreement with the conservatism profiles of the fold families. 

The profile solution predicts sequence entropy reasonably well for the majority, but not all, of amino acids. The 
amino acids that exhibit considerably higher conservatism than predicted from stability pressure alone are likely to 
be important for function and/or folding. Comparison of the "base- level" stability conservatism Sp(k) with S aC r(k) - 
actual conservatism profile of a protein fold - allows one to identify functionally and kinetically important amino acid 
residues and potentially gain specific insights into folding and function of a protein. 

Analysis of the correlation function confirms (i) the presence of an intrinsic time scale, r , at which designed 
sequences are similar and beyond which they differ strongly, (ii) the presence of the conserved positions in the course 
of i?-score design. The distributions of hamming distances between sequences reveal "clustering" of similar sequences 
(with low Hd) at short time scales r ~ r and disappearance of similarity at larger time scales r 3> r Q . The above 
distributions are in accord with the distribution of hamming distances in the families of homologs, taken from the 
HSSP database, and with that of analogs, taken from the FSSP database, correspondingly. 

The proposed study offers a plausible explanation of the clustering of structurally similar protein into families of 
homologs and analogs. From the perspective of the proposed view of evolution, the conservative amino acids appear 
as thermodynamically and kinetically important centers, mutations of which result in other (possibly strong) sequence 
modifications to preserve the physical properties of the parental proteins. Such modifications result in a new family 
of homologous proteins. In addition, the proposed model can be utilized to search for the thermodynamically and 
kinetically important amino acids in silica. 

Evolution is an extremely complex phenomenon, driven by numerous factors, such as history, preservation of 
function, folding kinetics and stability of proteins in response to change in cell/body environment. It is remarkable 
however, that our simple model was able to qualitatively capture certain aspects of protein evolution without any 
adjustable parameters (except for the contact definition threshold and the empirical matrix of amino acid pairwise 
interactions) . 

In addition, our model offers an additional hint favoring a divergent evolution. However, the question of whether 
evolution follows a divergent or convergent path is yet to be resolved. Extensive theoretical, phenomcnological and 
experimental effort may bring insight to this puzzle. 



We use the Cp representation of proteins in which each pair of amino acids is in contact if their Cps (C a in the case 
of Gly) are within the distance 7.5A ^2|. We use Miyazawa-Jernigan (MJ) |68| matrix of pair potentials to represent 
the interaction between each pair of 20 amino acids. The total potential energy of the protein can be written as 
follows: 



V. CONCLUSION 



VI. METHODS 



A. Protein model 




(19) 



9 



where N is the length of the protein, ai is an amino acid at the position i — 1, . . . , N. J/(<7j, cr,) is the corresponding 
element of the MJ matrix of pairwise interactions between amino acids Ui and Oj . Ay is the element of the contact 
matrix, that is defined to be 1 if contact between amino acids i and j exists (i. e. the distance between these amino 
acids in the native (ground) state is smaller than 7.5 A), and 0, if the above contact does not exist: 

A *i = \ , |r.P - rf s \ >7.5A, [M > 
where rf s is the position of the i th residue when the protein is in the native conformation. 



B. The 6-letter potential 



Due to the similarities in properties of the 20 types of amino acids, one can classify these amino acids into 6 
distinct groups: aliphatic {AVLIMC}, aromatic {FWYH}, polar {STNQ}, positive {KR}, negative {DE}, and 
special (reflecting their special conformational properties) {GP}. We construct the potential of interaction, Ue(&i, frj), 
between the six groups of amino acids, a, by computing the average interaction between these groups, i. e. 

Uei&u&j) = U 20 ((r k ,(xi), (21) 

where a denotes amino acids in 20- letter representation and f/2o(cfc,cz) is the 20- letter matrix of interaction MJ; 
a denotes amino acids in 6-letter representation. N& is the number of actual amino acids of type a, e. g. for the 
aliphatic group — 6. The 6-letter interaction potential for MJ 20-letter potential is given in Table [n]. 



C. The measure of the information context of the sequences 



In both the Z-score model and the profile solution, to study the information context of the sequences, we compute 
the sequence entropy, Sx(k), at each position, k, of the sequence, 

Sx(k) = -J2Px{(Tk)lnP x (a k ), (22) 

a 

where Px{o~k) is the probability that we observe an amino acid a k at the fcth position. Subscript X = Z or P denotes 
the Z-score model or profile solution correspondingly. The summation is taken over all possible values of o~k ■ 

The effect of switching to a 6-letter representation of amino acids from the 20-letter representation on the sequence 
entropy, Se(k), is that all values Se(k) are typically smaller than that of S2o{k). For a M-letter alphabet with all 
letters equally represented, i. e. Px(vk) = 1/M, the entropy is equal to InM. Thus, we expect that the difference 
between the typical values of S2o(k) and Se(k) is approximately ln(20/6) w 1.2. The case when all letters of a M-letter 
alphabet are equally presented corresponds to the maximal value of the entropy, i. e. 

S M (fc)<lnM. (23) 



D. The entropy of the protein fold families 

Theoretical predictions from statistical-mechanical analysis can be compared with data on real proteins. In order to 
determine conservatism in real proteins we assume that the space of sequences that fold into the same protein structure 
presents a two-tier system, where homologous sequences are grouped into families and there is no recognizable sequence 
homology between families despite the fact that they fold into closely related structures [p , [30| , [S3"| | . 

Using the database of protein families with close sequence similarity (HSSP database [[19|), we compute frequencies 
of amino acids at each position, fc, of aligned sequences, P m {&k), for a given, mth, family of proteins. We average 
these frequencies across all N s families sharing the same fold that are present in FSSP database Q: 

1 Ns 

PacrK) = ^ r X)PmK)- (24) 
m— 1 
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Next, we determine the sequence entropy, S aC r(k), at each position, k, of structurally aligned protein analogs: 

Sacr(k) = - *^2Pacr(vk)\nP acr ((T k ) . (25) 
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APPENDIX A: DETERMINATION OF CONTACT FREQUENCIES FROM HOMOPOLYMER 

CONFORMATIONS 

The estimation of frequencies is one of the key ingredients in protein design. An alternative approach to that 
proposed above is to assume that the set of conformational decoys is the set of all possible random coil states of 
a homopolymer collapsed at the temperatures below theta-point temperature, T < Tg — these are the states that 
decoy random heteropolymers explore at the folding transition temperature. Thus, we can determine the frequencies 
of contacts in an ensemble of random heteropolymers by taking the time average of a contact matrix element Ajj in 
the possible conformations of a homopolymer at T < Tg. 

To compute the frequencies of contacts for a homopolymer of length N, we use discrete molecular dynamics 
simulation s ]73|Jji] , p5| ]. We model a homopolymer by N beads on a string with the interaction distances scaled to 
7.5A. ( sec ]85[ for a detailed description of the model and the algorithm). We run the simulation at the temperature 
Tg (e parameter [^5| is set to -1) for 10 7 time units^. After 10 7 time units of simulations we compute the frequency 
fij of each of the N(N — l)/2 contacts in our homopolymer. 

There are two principal drawbacks of the second method: (1) the probability of occurrence of stable elements of 
the structure in homopolymers resembling secondary structure in proteins is so low that the distribution of contact 
lengths, P(i), in homopolymer (not shown) drastically differs from that shown for real proteins in Fig. ^. (2) The 
model of a homopolymer used in the simulations strongly differs (e.g. in flexibility) from real proteins. In fact, the 
problem of building an appropriate model for chain flexibility is so important that small variations in it result in 
drastically different kinetics from a realistic one (e.g. appearance of the intermediate states) |^,^]. We find that 
both of these drawbacks make the "homopolymer" approach of estimating the frequencies very inefficient for existing 
protein models, so we omit it in our studies. 

There are two strong advantages of this approach, which make it worthwhile to explore it in the future, upon the 
availability of the realistic protein models: (i) the possibility to generate a large amount of decoy conformations and, 
thus, achieve statistically highly significant contact frequency spectra; (ii) the independence of the produced decoys 
from various database biases. 
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TABLE I. The values of the correlation coefficient r for the linear regression of Sp(k) and R(k) versus S a cr for Ig, OB, 
R, a//3-P, and TIM folds and the corresponding optimal values of the temperature T = T 3e i for the Sp(k) versus Sacr linear 
regression. The last column corresponds to the correlation coefficient for the studied folds at a fixed selective temperature 
T = 0.25. To obtain the rates of mutations, R(k), we perform Z-score design of the sequences for td = 10 8 Monte Carlo steps 
at T des = 0.25. 
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TABLE II. A 6-letter potential derived (see Section VI B ) for MJ 20- letter potential. The symbols "1", "r", "p", "+", "-" 
and "s" denote 6 distinct corresponding groups of amino acids: aliphatic {AVLIMC}, aromatic {FWYH}, polar {STNQ}, 
positively charged {KR}, negatively charged {DE}, and special (reflecting their special conformational properties) {GP}. 
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FIG. 1. A schematic representation of the evolutionary processes that result in conservation patterns of amino acids. For a 
given family of folds, e.g. Ig in this diagram, there are several alternative minima (3) in the hypothetical free energy landscape 
in the sequence space as a function of the "evolutionary" reaction coordinate (e.g. time). Each of these minima are formed 
by mutations in protein sequences at time scales, r , that do not alter the protein's thermodynamically and/or kinetically 
important sites, forming families of homologous proteins. Transitions from one minimum to another occur at time scales 
r = ro exp(AG/T). At time scale r mutations occur that would alter several amino acids at the important sites of the proteins 
in such a way that the protein properties are not compromised. At time scale r the family of analogs is formed. In three 
minima we present three families of homologs (1TEN, 1FNF, and 1CFB) each comprised of six homologous proteins. We show 
10 positions in the aligned proteins: from 18 to 28. It can be observed that at position 4 (marked by blocks) in each of the 
families presented in the diagram amino acids are conserved within each family of homologs, but vary between these families. 
This position corresponds to position 21 in Ig fold alignment (to 1TEN) and is conserved (see Figs. [13(a)). 
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FIG. 3. Double-logarithmic plot of the distribution, P(i), of contacts of length I = \i — j\ in the ensemble of approximately 
10 3 representative globular proteins in (PDB) Pq , p7] . The contact between residues positioned ith and jth along the protein 
chain is denned by the Eq. (|2^) using C/3-representation of proteins. The parallel line in the range of length 20 < £ < 200 
indicates the power-law behavior of P(£) in this region, P(£) ~ ^ _1 ' 64 . The region 5 < £ < 20 is specific to proteins and has 
been discussed in detail in |88| . 
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FIG. 4. Three dimensional structures of the representative proteins of the five folds under study (Ig, OB, R, a/{3-P and TIM 
folds): (a) 1TEN protein, (b) 1MJC, (c) 3CHY, (d) 2ACY, and (e) 2EBN. 
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FIG. 5. The temperature dependence of the Z-score obtained after 10 5 Monte Carlo steps and averaged over 100 design 
"trajectories" for five representative proteins (1MJC, ITEN, 3CHY, 2ACY, and 2EBN) of the five folds studied. Due to 
normalization of the contacts' frequencies extracted from PDB database, the scales of values of Z are different from the actual 
.Z-score with correctly normalized Z — score. 
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FIG. 6. The histograms of hamming distances for 1MJC family between all designed sequences for two design times (a) 
td = 10 3 ^> t and (b) id = 10 2 ~ r Monte Carlo steps. The histograms of hamming distances for 1MJC family of actual 
protein sequences: (c) analogs, taken from the FSSP database, and (d) homologs, taken from the HSSP database. In the 
computation of the histograms in case (b) we omit all sequences with sequence similarity less than ID = 55% to mimic 
sequence collection in the HSSP database. The threshold sequence similarity, ID, is chosen so that the hamming distance 
histogram derived from the actual sequences in HSSP database (d). All histograms are normalized to unit area. 
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FIG. 7. (a) Plot of the correlation functions versus time for two position in 1MJC, Serl and Val31, obtained in the course of 
Z-score design of 10 3 sequences for 10 3 Monte Carlo steps. In semilogarithmic scale Ck=i(r) and Ck=3i(j) are straight lines 
with slopes r = 143 and r = 387 Monte Carlo steps correspondingly, (b) The histogram of the relaxation times r for all 
positions in 1MJC obtained in the course of Z-score design of 10 3 sequences for 10 3 Monte Carlo steps. The histogram is well 
fit by a Gaussian function in the region 100 < r < 250 (solid line). The long tail that strongly deviates from the Gaussian 
distribution (over seven standard deviations) indicates the presence of the conserved positions in the course of design. 
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FIG. 8. (a) The values R(k) (black line) and S a cr{k) (red line) for all positions, k, for the Ig-fold. The lower the values of 
R(k) the more conservative amino acids are at these positions, (b) The scatter plot of R(k) versus observed S a cr(k). The linear 
regression correlation coefficients are shown in Table |. The blue line is the linear regression approximation. In both parts (a) 
and (b) rates are multiplied by the length of the representative protein. 
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FIG. 9. (a) — (c) The same as Fig.| but for the OB-fold. 
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FIG. 10. (a) — (c) The same as FigM but for R-fold. 
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FIG. 11. (a) — (c) The same as Fig.| but for the a/^-P-fold. 
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FIG. 12. (a) — (c) The same as Fig§ but for TIM-fold. 
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FIG. 13. (a) The values Sp(k) (black line) and S a cr(k) (red line) for all positions, k, for the Ig-fold. The lower the values 
of Sp(k) the more conservative amino acids are at these positions, (b) The scatter plot of predicted Sp(k) versus observed 
Sacr(k). The linear regression correlation coefficients are shown in Table |l| The blue line is the linear regression that has a 
slope different than 1 (red line), corresponding to the Sp(k) = S acr (k) relation, (c) The histogram of the relative differences 
between Sp(k) and S aC r(k). In (b) we assign colors to data points corresponding to amino acids with the specific range of 
^-values JriJ r red, if 0.5 < <f> < 1, yellow, if 0.2 < <j> < 0.5, magenta, if 0.1 < <j> < 0.2, violet if <f> < 0.1, and black if 0-values are 
not determined. 
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FIG. 14. (a) — (c) The same as Fig.[jJ| but for the OB-fold. 
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FIG. 15. (a) — (c) The same as FigJ^ but for R-fold. In (b) we assign colors to data points corresponding to amino acids 
with the specific range of 0-values jH^]: red, if 0.3 < <f> < 1, yellow, if 0.1 < 4> < 0.3, violet if (f> < 0.1, and black if 0-values are 
not determined. 
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FIG. 16. (a) — (c) The same as Fig.[13| but for the a//3 — P-fold. In (b) we color red (2 out 3) nucleic amino acids, Tyrll 
and Pro54, M. 
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FIG. 17. (a) — (c) The same as FigjT| but for the TIM-fold. 
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